Introduction

Accurately predicting home prices is essential for stakeholders, such as buyers, investors, and policymakers, as it includes important financial decisions, urban planning, and policy making initiatives. However, home price predictions remain a difficult tasks, as they are influenced by a wide range of factors, including the housing characteristics, neighborhood amenities, and the overall neighborhood economic vibration. In this analysis, I build a predictive model for home prices in Charlotte, NC, using a dataset of housing information in Mecklenburg County, NC. The model will be based on the internal and external or spatial factors to predict the housing prices. The internal factors may include some housing basic characteristics, like footage, number of bedrooms, etc. The external factors may include the crime rate, median household income, the coverage of the public transportation,and access to the park and public spaces. By identifying and engineering such factors, our predictive model aims to minimize the prediction errors and provides deeper insights into the housing market in Charlotte, NC.

My analytical approach start with loading and cleaning the providing existing housing datasets. I first explored the datasets and identify the missing values and potential outliers. Then, utilizing the feature engineering process, I created new features from the existing features to improve the model performance, including the home ages and the dummy variables for the air conditioning and heating system. Additionally, spatial data and features were gathered from the American Community Survey, and open data portal, including crimes data, proximity to recreational spaces, and tree canopy. Following data preparation, I partitioned the data into training and testing subsets using an 80-20 split to reliably assess the model’s predictive performance. The model was developed using Ordinary Least Squares (OLS) regression and tested across various combinations of variables. The final model was evaluated using diagnostic metrics (MAE, RMSE, R-squared) and spatial residual analyses, including Moran’s I test, to detect spatial autocorrelation in residuals. The analysis concludes by addressing model limitations and potential to improvements.

Data Collection & Exploring

In this section, I clean the existing housing datasets and explore other sources of data that are important for the model building process. Second, I check the correlation between the variables to identify the multicollinearity between the variables, the linearity relationship between key predictors and the dependent variable, and the visual sense of autocorrelation in the data.

Step 1: Load and Clean the Data

The first step for this analysis start by loading the household data of the Charlotte, NC area. The data is in the geojson format with over 46,000 records of the housing information in Mecklenburg County, NC.

house<- st_read("data/studentData.geojson")

The original dataset contain many information that are not closely related with the housing prices, such as zip code, owner’s name, tax code. I filtered the datasets that only contains information about the price, bedrooms, yearbuilt, fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade, units, heatedfuel,and actype of the houses. The data is loaded and cleaned by removing the missing values and the houses with price less than 0. For prediction purposes, the price equal or below 0 is not useful.

house<- house %>%
  dplyr::select(price, bedrooms,yearbuilt,fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade,units, heatedfuel, actype)  %>%
  filter(!is.na(price) & !is.na(bedrooms) & !is.na(yearbuilt) & !is.na(fullbaths) & !is.na(halfbaths) & !is.na(heatedarea) & !is.na(storyheigh) & !is.na(numfirepla) & !is.na(bldggrade) & !is.na(units) & !is.na(heatedfuel) & !is.na(actype))%>%
  filter(price>0)

After checking housing prices, a small number of homes in Charlotte have exceptionally high values, representing clear outliers. These extreme values could significantly distort our model’s ability to accurately predict typical housing prices. Therefore, I have excluded all homes valued above 10 million dollars to ensure the robustness and accuracy of our predictive model.

house<- house %>%
  filter(price<10000000)

After the primary filtering, the dataset contains about 35,000 records of housing information in Mecklenburg County, NC.

Check the skewness of the variables

For the prediction purpose, the normal distribution of the variable matters because it helps ensure the accuracy and reliability of the model. When the variable is normally distributed, it usually means that the erroes or residuals are also normally distributed.

The histogram of the price shows that the price is right skewed. To make the price more normally distributed, I took the log transformation of the price. The histogram of the log price shows that the log price is more normally distributed than the original price. I used the log price as the dependent variable in the model building process.

ggplot(house, aes(x = price)) +
  geom_histogram(fill = "lightblue", bins= 500) +
  labs(title = "Histogram of Price", x = "Price", y = "Frequency")

house <- house %>%
  mutate(logprice = log(price))

ggplot(house, aes(x = logprice)) +
  geom_histogram(fill = "lightblue", bins= 500) +
  labs(title = "Histogram of Log Price", x = "Log Price", y = "Frequency")

Step 2: Feature engineering

The feature engineering is the process of creating new features from the existing features. In this analysis, I created the age of the house, from the housing built year. I also created the dummy variables for the air conditioning and heating system, and the categorical variable housing type. The dummy variables for the air conditioning and heating system are created by checking if the house has air conditioning or heating system, as housing with air conditioning and heating system tends to have higher price. The type variable is created by categorizing the house as mansion, castle and normal house. The house has more than 5000 square feet of heated area, more than 3 bedrooms, and more than 3 full baths ia categorized as mansion. The house has more than 10000 square feet of heated area, more than 5 bedrooms, and more than 5 full baths is categorized as castle. The rest of the houses are categorized as normal house.

The building grade is converted to a numeric variable by assigning a number to each category based on factors. The new features are created to improve the model performance.

Grade Description Typical Examples Quality & Value Level
Minimum Very basic, minimal standards,limited durability. Old sheds, simple outbuildings, deteriorating structures Lowest
Fair Below-average condition, noticeable wear or deferred maintenance. Older homes/apartments with deferred maintenance, structures needing renovation Below Average
Average Standard construction quality, functional and maintained condition. Typical single-family homes, standard apartment units Moderate
Good Above-average quality construction, good condition. Newer suburban homes, renovated units, higher-quality commercial buildings Above Average
Excellent Superior quality, exceptional finishes, and meticulous attention to detail. High-end residences, luxury condominiums, upscale commercial properties High
Custom Unique or specialized construction, highly personalized design,typically luxury or architecturally significant. Luxury custom homes, unique historical restorations, architect-designed high-end properties Highest or Specialized

Source: https://localdocs.charlotte.edu/Tax_Collections/Reports_Studies/

house <- house %>%
  mutate(
    age = 2022 - yearbuilt,
    storyheigh = as.numeric(gsub("[^0-9.]", "", storyheigh)),
    bldggrade = case_when(
      bldggrade == "MINIMUM" ~ 1,
      bldggrade == "FAIR" ~ 2,
      bldggrade == "AVERAGE" ~ 3,
      bldggrade == "GOOD" ~ 4,
      bldggrade == "VERY GOOD" ~ 5,
      bldggrade == "EXCELLENT" ~ 6,
      bldggrade == "CUSTOM" ~ 7),
    ac_dummy = ifelse(actype == "AC-NONE", 0, 1),
    heat_dummy = ifelse(heatedfuel == "NONE", 0, 1),
    mansion= ifelse(heatedarea>5000 & heatedarea<10001 & bedrooms>2 & bedrooms<6 & fullbaths>2, 1, 0),
    castle= ifelse(heatedarea>10000 & bedrooms>5 & fullbaths>5, 1, 0),
    house= ifelse(mansion==0 & castle==0, 1, 0),
    heatfuel = ifelse(heatedfuel == "NONE", NA, heatedfuel)
    )

house <- house %>%
  dplyr::select(-yearbuilt, -actype, -heatedfuel)%>%
  mutate(type= case_when(
    house==1 ~ "House",
    mansion==1 ~ "Mansion",
    castle==1 ~ "Castle"
  ))

Additional spatial data to predict the housing price

Many other factors can affect the housing price that not contains in the given datasets. Many other spatial variables affect the housing price and value as well. For example, the housing values are closely correlate with the crime rates in the neighborhood, median household income and whether the house is close to the public transportation and access to the park and public spaces. All those variables should be included in the prediction model to improve the model performance.

The additional spatial data is collected from American Community Survey, City of Charlotte open data portal and other sources. The additional spatial data includes the crime rate, the distance to the public transportation, the distance to the park. The additional spatial data is collected and cleaned to be used in the model building process.

Median household income data was collected from the American Community Survey at the census block group level. The median household income values were then assigned to individual housing data points based on their geographic location.

income<- get_acs(
  geography = "block group",
  variables = c("income"="B19013_001"),
  state = "NC",
  county = "Mecklenburg",
  geometry = TRUE
  )%>%
  st_transform(crs =st_crs(house))%>%
  filter(!is.na(estimate))%>%
  dplyr::select(estimate, GEOID, geometry)%>%
  rename(income=estimate)

#spatial join the income data to the house data
house <- st_join(house, income)%>%
  filter(!is.na(income))

Crime data was collected from the City of Charlotte open data portal from police station. And then, the crime data is aggregated at the census block group level same as the household income data. Finally, the crime data was spatially joined to the house data based on the house geolocation.

crime<- read.csv("data/CMPD.csv")%>%
  filter(YEAR>=2021)

crime<- crime %>%
  dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC, YEAR)%>%
  st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = 4326)%>%
  st_transform(crime, crs =st_crs(house))

crime_tract<- st_join(house, crime)%>%
  group_by(GEOID)%>%
  summarise(crime=n())

house <- st_join(house, crime_tract)%>%
  dplyr::select(-GEOID.x, -GEOID.y)%>%
  filter(!is.na(crime))

Public transportation coverage data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units located within 0.5 miles of a transit stop at the neighborhood level as defined by Mecklenburg County. The neighborhood-level transit coverage values were then assigned to individual housing data points based on their geographic location.

transit <- st_read("data/transportation.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(transit=X2023)

#spatial join the transit data to the house data
house <- st_join(house, transit)%>%
  filter(!is.na(transit))

Tree Canopy data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset represents the percentage of tree cover in each neighborhood. Like the transit coverage data, these values are linked to the housing data based on each property’s geolocation.

tree <- st_read("data/tree_canopy.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(canopy=X2023)

#spatial join the tree canopy data to the house data
house <- st_join(house, tree)

Approximate distance to the park data was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units within 0.5 miles of an outdoor public recreation area at the neighborhood level. Similar to other spatial datasets, the park accessibility data was assigned to individual housing records based on their geographic locations.

recreation<- st_read("data/recreation.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(recreation=X2023)

#spatial join the recreation data to the house data
house <- st_join(house, recreation)%>%
  filter(!is.na(recreation))

The potential drawback of these approaches is that the data are aggregated at large geographic units rather than at the individual household level. Households within the same census block group or neighborhood may differ in their access to transit and other services. Additionally, households within the same neighborhood may vary significantly in income levels. Despite these limitations, the aggregated data represent the best available source for use in the model-building process.

Step 3: Summary Statistics and Correlation Matrix

The summary statistics and correlation matrix are calculated to understand the relationship between the variables. The summary statistics show the mean and standard deviation (SD) of the variables. The correlation matrix shows the correlation between the variables. The correlation matrix is used to understand the relationship between the variables and to identify the multicollinearity between the variables. Lastly, various scatter plots are created to visualize the relationship between the key predictors and the dependent variable to ensure the linearity relationship between the predictors and the dependent variable.

Summary Statistics

For the summary statistics, I calculated the mean and standard deviation of the variables. The summary statistics are calculated to understand the distribution of the variables. The summary statistics are used to understand the distribution of the variables and to identify the outliers in the data.

dependent_var <- "logprice"

predictors <- c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "income", "crime", "transit", "canopy", "recreation")

summary_stats <- house %>%
  st_drop_geometry() %>%
  dplyr::select(all_of(c(dependent_var, predictors))) %>%
  summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
  pivot_wider(names_from = Stat, values_from = Value)



summary_stats <- summary_stats %>%
  mutate(Variable = case_when(
    Variable == "logprice" ~ "Log transformed House Value",
    Variable == "bedrooms" ~ "Number of bedrooms",
    Variable == "fullbaths" ~ "Number of full baths",
    Variable == "halfbaths" ~ "Number of half baths",
    Variable == "heatedarea" ~ "House Square footage",
    Variable == "storyheigh" ~ "Number of stories",
    Variable == "numfirepla" ~ "Number of fireplaces",
    Variable == "bldggrade" ~ "Building Grade",
    Variable == "units" ~ "Number of units",
    Variable == "age" ~ "Age of the house",
    Variable == "income" ~ "Median household income",
    Variable == "crime" ~ "Number of Crime in Census Block group",
    Variable == "transit" ~ "% Transit Coverage",
    Variable == "canopy" ~ "% Tree Canopy Coverage",
    Variable == "recreation" ~ "% Recreation Coverage",
    TRUE ~ Variable
  ))




summary_stats <- summary_stats %>%
  mutate(
    Mean = round(Mean, 2),
    SD = round(SD, 2)
  )

summary_stats <- summary_stats %>%
  arrange(Variable == "Log transformed House Value")

predictor_rows <- which(summary_stats$Variable != "Log transformed House Value")
dependent_rows <- which(summary_stats$Variable == "Log transformed House Value")

# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred   <- max(predictor_rows)
start_dep  <- min(dependent_rows)
end_dep    <- max(dependent_rows)

# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
      align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
  add_header_above(c(" " = 1, "Statistics" = 2)) %>%
  kable_styling(full_width = FALSE) %>%
  group_rows("Predictors", start_pred, end_pred) %>%
  group_rows("Dependent Variable", start_dep, end_dep)%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
Summary Statistics
Statistics
Variable Mean SD
Predictors
Number of bedrooms 3.52 0.91
Number of full baths 2.29 0.83
Number of half baths 0.60 0.53
House Square footage 2366.73 1057.49
Number of stories 1.65 0.48
Number of fireplaces 0.79 0.47
Building Grade 3.43 0.78
Number of units 0.98 0.19
Age of the house 28.26 24.04
Median household income 111240.12 50146.79
Number of Crime in Census Block group 156.21 144.64
% Transit Coverage 52.13 39.83
% Tree Canopy Coverage 50.31 14.57
% Recreation Coverage 55.12 33.92
Dependent Variable
Log transformed House Value 12.88 0.54

Correlation Matrix

predictor_vars <- house[, c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()


cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")

ggcorrplot(cor_matrix,
           method = "square",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("#a4133c", "white", "#a4133c"))+
    labs(title = "Correlation Matrix for all Predictor Variables") +
    theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))

Since there are high correlation between the heated area and full baths, I will remove the full baths from the model building process. The full baths is removed to avoid the multicollinearity between the variables. Similarly, the number of bedrooms and heated area are highly correlated, I will remove the number of bedroom a from the model building process. Storyheight is also removed for the same reason.

After removing, the final correlation matrix is following:

predictor_vars <- house[, c( "halfbaths", "heatedarea", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()


cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")

ggcorrplot(cor_matrix,
           method = "square",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("#a4133c", "white", "#a4133c"))+
    labs(title = "Correlation Matrix for all Predictor Variables") +
    theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))

house<- house %>%
  dplyr::select(-fullbaths, -storyheigh, -bedrooms)

Check the VIF

To ensure the multicollinearity between the variables, I checked the Variance Inflation Factor (VIF) of the variables. The VIF values are calculated to further check the multicollinearity between the variables. The VIF values are used to identify the multicollinearity between the variables. The VIF values above 5 indicate high multicollinearity between the variables. The VIF values are calculated to ensure the multicollinearity between the variables. The VIF values for all the variables in the model are below 3, which indicates that there is no multicollinearity between the variables.

model<- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = house)

# Compute VIF values
vif_values <- vif(model)

# Convert the VIF values to a data frame
vif_df <- data.frame(Variable = names(vif_values),
                     VIF = as.vector(vif_values))

# Display the VIF table using knitr::kable
knitr::kable(vif_df, caption = "Variance Inflation Factors for the Model")
Variance Inflation Factors for the Model
Variable VIF
halfbaths 1.199884
heatedarea 2.448252
numfirepla 1.249302
bldggrade 2.023239
units 1.063245
age 1.727478
ac_dummy 1.176752
heat_dummy 1.037838
income 1.784013
crime 1.604203
transit 1.942375
canopy 1.382799
recreation 1.364279

Scatterplot

longer<-house %>%
  pivot_longer(cols = c(
    "heatedarea",
    "age",
    "crime",
    "transit"),
               names_to = "Variable",
               values_to = "Value")%>%
  st_drop_geometry()

ggplot(longer,aes(x = Value, y = price)) +
  geom_point(color = "black", size= 0.4) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c("heatedarea" = "Heated Area",
                                                                  "age" = "Age of the House",
                                                                  "crime" = "Number of Crime",
                                                                  "transit" = "Transit Coverage"
  )))  +
  theme_light() +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6),
        axis.title=element_text(size=8)) +
  labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
       x = "Value",
       y = "House price")

Step 4: Mapping

Dependent Variable- Housing price

map<-st_join(house, income)%>%
  group_by(GEOID)%>%
  summarise(price=mean(price))%>%
  st_drop_geometry()

map<- left_join(income, map, by="GEOID")%>%
  filter(!is.na(price))


ggplot(map) +
  geom_sf(aes(fill= q5(price)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(map, 'price')), 0),
                     name = 'Housing price') +
  labs(title = "Average House Price by Census Block Group",
       fill = "Price") +
    theme(
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 10, face = "italic"),
        plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

Key predictors

Q1<-ggplot(map) +
  geom_sf(aes(fill= q5(income)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(map, 'income')), 0),
                     name = 'Income') +
  labs(title = "Average Household Income by Census Block Group",
       fill = "Income") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

Q2<-ggplot(tree) +
  geom_sf(aes(fill= q5(canopy)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
                     name = '% of Tree Coverage') +
  labs(title = "Tree Canopy by Neighborhood",
       fill = "% of Tree Coverage") +
    theme(
        legend.text = element_text(size =4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

recreation<-recreation%>% filter(!is.na(recreation))
Q3<-ggplot(recreation) +
  geom_sf(aes(fill= q5(recreation)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
                     name = '% of Recreation Coverage') +
  labs(title = "Recreation facilities coverage by Neighborhood",
       fill = "% of Recreational facilities Coverage") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

transit<-transit%>% filter(!is.na(transit))

Q4<-ggplot(transit) +
  geom_sf(aes(fill= q5(transit)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(transit, 'transit')), 0),
                     name = '% of Transit Coverage') +
  labs(title = "Public Transit coverage by Neighborhood",
       fill = "% of Public Transit Coverage") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

# Arrange the plots in a 2x2 grid
grid.arrange(Q1, Q2, Q3, Q4, ncol = 2)

Model Building & Evaluation

Train-Test Split

In this model building process, I spilt the training and testing datasets based on 80% training and 20% testing to ensure the model have a better predictivity

# Set the seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
train_index <- createDataPartition(house$price, p = 0.8, list = FALSE)

train <- house[train_index, ]
test <- house[-train_index, ]

Model Building and Evaluation

Initial model

model1 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation, data = train)

test<-test%>%
  mutate(predictions = predict(model1, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
Model Performance Metrics
Metric Value
Mean Absolute Error (MAE) 0.21552
Root Mean Squared Error (RMSE) 0.30249
Test R-squared 0.68026

Include the dummy variables

model2 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = train)

test<-test%>%
  mutate(predictions = predict(model2, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
Model Performance Metrics
Metric Value
Mean Absolute Error (MAE) 0.21518
Root Mean Squared Error (RMSE) 0.30156
Test R-squared 0.68221

Include the Categorical Variables

model3 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation + type, data = train)

test<-test%>%
  mutate(predictions = predict(model3, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
Model Performance Metrics
Metric Value
Mean Absolute Error (MAE) 0.21529
Root Mean Squared Error (RMSE) 0.30205
Test R-squared 0.68118

Final model

model4 <- lm(logprice ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)
summary(model4)
## 
## Call:
## lm(formula = logprice ~ heatedarea + numfirepla + bldggrade + 
##     units + age + income + crime + transit + canopy + type + 
##     ac_dummy + heat_dummy, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4978 -0.1626  0.0106  0.1760  3.1461 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept) 10.36348279816  0.16726830837  61.957 < 0.0000000000000002 ***
## heatedarea   0.00021952749  0.00000288363  76.129 < 0.0000000000000002 ***
## numfirepla   0.05547169198  0.00450513909  12.313 < 0.0000000000000002 ***
## bldggrade    0.23680234526  0.00343081975  69.022 < 0.0000000000000002 ***
## units       -0.08285680814  0.00993145730  -8.343 < 0.0000000000000002 ***
## age          0.00174947866  0.00010033112  17.437 < 0.0000000000000002 ***
## income       0.00000230146  0.00000004984  46.176 < 0.0000000000000002 ***
## crime       -0.00012404294  0.00001618311  -7.665   0.0000000000000185 ***
## transit      0.00095080578  0.00006181952  15.380 < 0.0000000000000002 ***
## canopy      -0.00263274567  0.00015069211 -17.471 < 0.0000000000000002 ***
## typeHouse    0.87181193077  0.15862322298   5.496   0.0000000391595998 ***
## typeMansion  0.75267648694  0.15851853081   4.748   0.0000020624871705 ***
## ac_dummy     0.19300153324  0.01130432411  17.073 < 0.0000000000000002 ***
## heat_dummy  -0.03969739237  0.05069505071  -0.783                0.434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3144 on 28242 degrees of freedom
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6644 
## F-statistic:  4305 on 13 and 28242 DF,  p-value: < 0.00000000000000022
test<-test%>%
  mutate(predictions = predict(model4, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
Model Performance Metrics
Metric Value
Mean Absolute Error (MAE) 0.21505
Root Mean Squared Error (RMSE) 0.30126
Test R-squared 0.68285

model diagnostics

par(mfrow = c(2, 2))
plot(model4)

Compare with non log transformed price model

model5 <- lm(price ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)

par(mfrow = c(2, 2))
plot(model5)

Spatial Analysis & Residual Diagnostics

model_final<-lm(logprice ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = house)

fitted_values <- fitted(model_final)
residuals <- residuals(model_final)

house<- house %>%
  mutate(fitted = fitted_values,
         residuals = residuals)

ggplot(house, aes(x = fitted, y = logprice)) +
  geom_point(color = "black", size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Predicted Values vs. Observed Values",
       x = "Observed Values",
       y = "Predicted Values") +
  theme_light() +
  theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))

residual_map<-st_join(income, house)%>%
  group_by(GEOID)%>%
  summarise(residual=mean(residuals))%>%
  filter(!is.na(residual))

ggplot(residual_map) +
  geom_sf(aes(fill= q5(residual)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(residual_map, 'residual')), 6),
                     name = 'Residuals') +
  labs(title = "Residuals Spatial Distribution",
       fill = "Residuals") +
    theme(
        legend.text = element_text(size =8),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 8, face = "italic"),
        plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(test) 

# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))

# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")

moran <- moran.mc(test$error, spatialWeights, nsim = 999)

ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(house) 

# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))

# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")

moran <- moran.mc(house$residuals, spatialWeights, nsim = 999)

ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

Conclusion

---
title: 'MUSA 5080 Midterm: Predictive model for home prices in Charlotte, NC'
author: "Zhanchao Yang"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: journal
    highlight: tango
    toc: true
    code_folding: hide
    code_download: yes
    toc_float:
      collapsed: true

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(scipen = 999)
library(tidyverse)
library(caret)
library(ggcorrplot)
library(sf)
library(kableExtra)
library(tidycensus)
library(tidyverse)
library(broom)
library(ggplot2)
library(gridExtra)
library(car)
library(caret)
library(MASS)
library(tidyverse)
library(spdep) #for spatial autocorrelation
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(kableExtra)
library(tidycensus)
library(RColorBrewer)
```

```{r, include=FALSE}
# Map element
q5 <- function(variable) {as.factor(ntile(variable, 5))}
flatreds5 <- c("#0081a7",  "#00afb9", "#fdfcdc", "#fed9b7", "#f07167")
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],3),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}
```

# Introduction
Accurately predicting home prices is essential for stakeholders, such as buyers, investors, and policymakers, as it includes important financial decisions, urban planning, and policy making initiatives. However, home price predictions remain a difficult tasks, as they are influenced by a wide range of factors, including the housing characteristics, neighborhood amenities, and the overall neighborhood economic vibration. In this analysis, I build a predictive model for home prices in Charlotte, NC, using a dataset of housing information in Mecklenburg County, NC. The model will be based on the internal and external or spatial factors to predict the housing prices. The internal factors may include some housing basic characteristics, like footage, number of bedrooms, etc. The external factors may include the crime rate, median household income, the coverage of the public transportation,and access to the park and public spaces. By identifying and engineering such factors, our predictive model aims to minimize the prediction errors and provides deeper insights into the housing market in Charlotte, NC.

My analytical approach start with loading and cleaning the providing existing housing datasets. I first explored the datasets and identify the missing values and potential outliers. Then, utilizing the feature engineering process, I created new features from the existing features to improve the model performance, including the home ages and the dummy variables for the air conditioning and heating system. Additionally, spatial data and features were gathered from the American Community Survey, and open data portal, including crimes data, proximity to recreational spaces, and tree canopy. Following data preparation, I partitioned the data into training and testing subsets using an 80-20 split to reliably assess the model's predictive performance. The model was developed using Ordinary Least Squares (OLS) regression and tested across various combinations of variables. The final model was evaluated using diagnostic metrics (MAE, RMSE, R-squared) and spatial residual analyses, including Moran's I test, to detect spatial autocorrelation in residuals. The analysis concludes by addressing model limitations and potential to improvements.

# Data Collection & Exploring
In this section, I clean the existing housing datasets and explore other sources of data that are important for the model building process. Second, I check the correlation between the variables to identify the multicollinearity between the variables, the linearity relationship between key predictors and the dependent variable, and the visual sense of autocorrelation in the data.

## Step 1: Load and Clean the Data
The first step for this analysis start by loading the household data of the Charlotte, NC area. The data is in the geojson format with over 46,000 records of the housing information in Mecklenburg County, NC.

```{r, results='hide'}
house<- st_read("data/studentData.geojson")
```

The original dataset contain many information that are not closely related with the housing prices, such as zip code, owner's name, tax code. I filtered the datasets that only contains information about the price, bedrooms, yearbuilt, fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade, units, heatedfuel,and actype of the houses. The data is loaded and cleaned by removing the missing values and the houses with price less than 0. For prediction purposes, the price equal or below 0 is not useful.

```{r}
house<- house %>%
  dplyr::select(price, bedrooms,yearbuilt,fullbaths, halfbaths, heatedarea, storyheigh, numfirepla, bldggrade,units, heatedfuel, actype)  %>%
  filter(!is.na(price) & !is.na(bedrooms) & !is.na(yearbuilt) & !is.na(fullbaths) & !is.na(halfbaths) & !is.na(heatedarea) & !is.na(storyheigh) & !is.na(numfirepla) & !is.na(bldggrade) & !is.na(units) & !is.na(heatedfuel) & !is.na(actype))%>%
  filter(price>0)
```

After checking housing prices, a small number of homes in Charlotte have exceptionally high values, representing clear outliers. These extreme values could significantly distort our model's ability to accurately predict typical housing prices. Therefore, I have excluded all homes valued above 10 million dollars to ensure the robustness and accuracy of our predictive model.

```{r}
house<- house %>%
  filter(price<10000000)
```

After the primary filtering, the dataset contains about 35,000 records of housing information in Mecklenburg County, NC.

### Check the skewness of the variables

For the prediction purpose, the normal distribution of the variable matters because it helps ensure the accuracy and reliability of the model. When the variable is normally distributed, it usually means that the erroes or residuals are also normally distributed.

The histogram of the price shows that the price is right skewed. To make the price more normally distributed, I took the log transformation of the price. The histogram of the log price shows that the log price is more normally distributed than the original price. I used the log price as the dependent variable in the model building process.

```{r, fig.width=8, fig.height=6}
ggplot(house, aes(x = price)) +
  geom_histogram(fill = "lightblue", bins= 500) +
  labs(title = "Histogram of Price", x = "Price", y = "Frequency")
```
```{r, fig.width=8, fig.height=6}
house <- house %>%
  mutate(logprice = log(price))

ggplot(house, aes(x = logprice)) +
  geom_histogram(fill = "lightblue", bins= 500) +
  labs(title = "Histogram of Log Price", x = "Log Price", y = "Frequency")

```

## Step 2: Feature engineering

The feature engineering is the process of creating new features from the existing features.  In this analysis, I created the age of the house, from the housing built year. I also created the dummy variables for the air conditioning and heating system, and the categorical variable housing `type`. The dummy variables for the air conditioning and heating system are created by checking if the house has air conditioning or heating system, as housing with air conditioning and heating system tends to have higher price. The type variable is created by categorizing the house as mansion, castle and normal house. The house has more than 5000 square feet of heated area, more than 3 bedrooms, and more than 3 full baths ia categorized as mansion. The house has more than 10000 square feet of heated area, more than 5 bedrooms, and more than 5 full baths is categorized as castle. The rest of the houses are categorized as normal house.

The building grade is converted to a numeric variable by assigning a number to each category based on factors. The new features are created to improve the model performance.


| Grade        | Description                                                     | Typical Examples                                      | Quality & Value Level     |
|--------------|-----------------------------------------------------------------|-------------------------------------------------------|---------------------------|
| **Minimum**  | Very basic, minimal standards,limited durability. | Old sheds, simple outbuildings, deteriorating structures | **Lowest**                |
| **Fair**     | Below-average condition, noticeable wear or deferred maintenance. | Older homes/apartments with deferred maintenance, structures needing renovation | **Below Average**         |
| **Average**  | Standard construction quality, functional and maintained condition. | Typical single-family homes, standard apartment units | **Moderate**              |
| **Good**     | Above-average quality construction, good condition. | Newer suburban homes, renovated units, higher-quality commercial buildings | **Above Average**         |
| **Excellent**| Superior quality, exceptional finishes, and meticulous attention to detail. | High-end residences, luxury condominiums, upscale commercial properties | **High**                  |
| **Custom**   | Unique or specialized construction, highly personalized design,typically luxury or architecturally significant. | Luxury custom homes, unique historical restorations, architect-designed high-end properties | **Highest or Specialized** |

Source: https://localdocs.charlotte.edu/Tax_Collections/Reports_Studies/


```{r}
house <- house %>%
  mutate(
    age = 2022 - yearbuilt,
    storyheigh = as.numeric(gsub("[^0-9.]", "", storyheigh)),
    bldggrade = case_when(
      bldggrade == "MINIMUM" ~ 1,
      bldggrade == "FAIR" ~ 2,
      bldggrade == "AVERAGE" ~ 3,
      bldggrade == "GOOD" ~ 4,
      bldggrade == "VERY GOOD" ~ 5,
      bldggrade == "EXCELLENT" ~ 6,
      bldggrade == "CUSTOM" ~ 7),
    ac_dummy = ifelse(actype == "AC-NONE", 0, 1),
    heat_dummy = ifelse(heatedfuel == "NONE", 0, 1),
    mansion= ifelse(heatedarea>5000 & heatedarea<10001 & bedrooms>2 & bedrooms<6 & fullbaths>2, 1, 0),
    castle= ifelse(heatedarea>10000 & bedrooms>5 & fullbaths>5, 1, 0),
    house= ifelse(mansion==0 & castle==0, 1, 0),
    heatfuel = ifelse(heatedfuel == "NONE", NA, heatedfuel)
    )

house <- house %>%
  dplyr::select(-yearbuilt, -actype, -heatedfuel)%>%
  mutate(type= case_when(
    house==1 ~ "House",
    mansion==1 ~ "Mansion",
    castle==1 ~ "Castle"
  ))
```


### Additional spatial data to predict the housing price

Many other factors can affect the housing price that not contains in the given datasets. Many other spatial variables affect the housing price and value as well. For example, the housing values are closely correlate with the crime rates in the neighborhood, median household income and whether the house is close to the public transportation and access to the park and public spaces. All those variables should be included in the prediction model to improve the model performance.

The additional spatial data is collected from American Community Survey, City of Charlotte open data portal and other sources. The additional spatial data includes the crime rate, the distance to the public transportation, the distance to the park. The additional spatial data is collected and cleaned to be used in the model building process.

**Median household income data** was collected from the American Community Survey at the census block group level. The median household income values were then assigned to individual housing data points based on their geographic location.

```{r, results='hide', warning=FALSE, message=FALSE}
income<- get_acs(
  geography = "block group",
  variables = c("income"="B19013_001"),
  state = "NC",
  county = "Mecklenburg",
  geometry = TRUE
  )%>%
  st_transform(crs =st_crs(house))%>%
  filter(!is.na(estimate))%>%
  dplyr::select(estimate, GEOID, geometry)%>%
  rename(income=estimate)

#spatial join the income data to the house data
house <- st_join(house, income)%>%
  filter(!is.na(income))
```

**Crime data** was collected from the City of Charlotte open data portal from police station. And then, the crime data is aggregated at the census block group level same as the household income data. Finally, the crime data was spatially joined to the house data based on the house geolocation.

```{r}
crime<- read.csv("data/CMPD.csv")%>%
  filter(YEAR>=2021)

crime<- crime %>%
  dplyr::select(LATITUDE_PUBLIC, LONGITUDE_PUBLIC, YEAR)%>%
  st_as_sf(coords = c("LONGITUDE_PUBLIC", "LATITUDE_PUBLIC"), crs = 4326)%>%
  st_transform(crime, crs =st_crs(house))

crime_tract<- st_join(house, crime)%>%
  group_by(GEOID)%>%
  summarise(crime=n())

house <- st_join(house, crime_tract)%>%
  dplyr::select(-GEOID.x, -GEOID.y)%>%
  filter(!is.na(crime))
```

**Public transportation coverage data** was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units located within 0.5 miles of a transit stop at the neighborhood level as defined by Mecklenburg County. The neighborhood-level transit coverage values were then assigned to individual housing data points based on their geographic location.

```{r, results='hide'}
transit <- st_read("data/transportation.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(transit=X2023)

#spatial join the transit data to the house data
house <- st_join(house, transit)%>%
  filter(!is.na(transit))
```

**Tree Canopy data** was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset represents the percentage of tree cover in each neighborhood. Like the transit coverage data, these values are linked to the housing data based on each property's geolocation.

```{r, results='hide'}
tree <- st_read("data/tree_canopy.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(canopy=X2023)

#spatial join the tree canopy data to the house data
house <- st_join(house, tree)
```

**Approximate distance to the park data** was collected from the Charlotte/Mecklenburg Quality of Life Explorer. This dataset measures the percentage of housing units within 0.5 miles of an outdoor public recreation area at the neighborhood level. Similar to other spatial datasets, the park accessibility data was assigned to individual housing records based on their geographic locations.

```{r, results='hide'}
recreation<- st_read("data/recreation.geojson")%>%
  dplyr::select(X2023, geometry)%>%
  rename(recreation=X2023)

#spatial join the recreation data to the house data
house <- st_join(house, recreation)%>%
  filter(!is.na(recreation))
```
The potential drawback of these approaches is that the data are aggregated at large geographic units rather than at the individual household level. Households within the same census block group or neighborhood may differ in their access to transit and other services. Additionally, households within the same neighborhood may vary significantly in income levels. Despite these limitations, the aggregated data represent the best available source for use in the model-building process.

## Step 3: Summary Statistics and Correlation Matrix

The summary statistics and correlation matrix are calculated to understand the relationship between the variables. The summary statistics show the mean and standard deviation (SD) of the variables. The correlation matrix shows the correlation between the variables. The correlation matrix is used to understand the relationship between the variables and to identify the multicollinearity between the variables. Lastly, various scatter plots are created to visualize the relationship between the key predictors and the dependent variable to ensure the linearity relationship between the predictors and the dependent variable.

### Summary Statistics
For the summary statistics, I calculated the mean and standard deviation of the variables. The summary statistics are calculated to understand the distribution of the variables. The summary statistics are used to understand the distribution of the variables and to identify the outliers in the data.

```{r chart}
dependent_var <- "logprice"

predictors <- c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "income", "crime", "transit", "canopy", "recreation")

summary_stats <- house %>%
  st_drop_geometry() %>%
  dplyr::select(all_of(c(dependent_var, predictors))) %>%
  summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
  pivot_wider(names_from = Stat, values_from = Value)



summary_stats <- summary_stats %>%
  mutate(Variable = case_when(
    Variable == "logprice" ~ "Log transformed House Value",
    Variable == "bedrooms" ~ "Number of bedrooms",
    Variable == "fullbaths" ~ "Number of full baths",
    Variable == "halfbaths" ~ "Number of half baths",
    Variable == "heatedarea" ~ "House Square footage",
    Variable == "storyheigh" ~ "Number of stories",
    Variable == "numfirepla" ~ "Number of fireplaces",
    Variable == "bldggrade" ~ "Building Grade",
    Variable == "units" ~ "Number of units",
    Variable == "age" ~ "Age of the house",
    Variable == "income" ~ "Median household income",
    Variable == "crime" ~ "Number of Crime in Census Block group",
    Variable == "transit" ~ "% Transit Coverage",
    Variable == "canopy" ~ "% Tree Canopy Coverage",
    Variable == "recreation" ~ "% Recreation Coverage",
    TRUE ~ Variable
  ))




summary_stats <- summary_stats %>%
  mutate(
    Mean = round(Mean, 2),
    SD = round(SD, 2)
  )

summary_stats <- summary_stats %>%
  arrange(Variable == "Log transformed House Value")

predictor_rows <- which(summary_stats$Variable != "Log transformed House Value")
dependent_rows <- which(summary_stats$Variable == "Log transformed House Value")

# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred   <- max(predictor_rows)
start_dep  <- min(dependent_rows)
end_dep    <- max(dependent_rows)

# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
      align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
  add_header_above(c(" " = 1, "Statistics" = 2)) %>%
  kable_styling(full_width = FALSE) %>%
  group_rows("Predictors", start_pred, end_pred) %>%
  group_rows("Dependent Variable", start_dep, end_dep)%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
```

### Correlation Matrix

```{r, fig.width=8, fig.height=6}

predictor_vars <- house[, c("bedrooms", "fullbaths", "halfbaths", "heatedarea", "storyheigh", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()


cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")

ggcorrplot(cor_matrix,
           method = "square",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("#a4133c", "white", "#a4133c"))+
    labs(title = "Correlation Matrix for all Predictor Variables") +
    theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))
```

Since there are high correlation between the heated area and full baths, I will remove the full baths from the model building process. The full baths is removed to avoid the multicollinearity between the variables. Similarly, the number of bedrooms and heated area are highly correlated, I will remove the number of bedroom a from the model building process. Storyheight is also removed for the same reason.

After removing, the final correlation matrix is following:
```{r, fig.width=8, fig.height=6}
predictor_vars <- house[, c( "halfbaths", "heatedarea", "numfirepla", "bldggrade", "units", "age", "ac_dummy", "heat_dummy", "income", "crime", "transit", "canopy", "recreation")]%>%st_drop_geometry()


cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")

ggcorrplot(cor_matrix,
           method = "square",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("#a4133c", "white", "#a4133c"))+
    labs(title = "Correlation Matrix for all Predictor Variables") +
    theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))
house<- house %>%
  dplyr::select(-fullbaths, -storyheigh, -bedrooms)

```

**Check the VIF**

To ensure the multicollinearity between the variables, I checked the Variance Inflation Factor (VIF) of the variables. The VIF values are calculated to further check the multicollinearity between the variables. The VIF values are used to identify the multicollinearity between the variables. The VIF values above 5 indicate high multicollinearity between the variables. The VIF values are calculated to ensure the multicollinearity between the variables. The VIF values for all the variables in the model are below 3, which indicates that there is no multicollinearity between the variables.

```{r}
model<- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = house)

# Compute VIF values
vif_values <- vif(model)

# Convert the VIF values to a data frame
vif_df <- data.frame(Variable = names(vif_values),
                     VIF = as.vector(vif_values))

# Display the VIF table using knitr::kable
knitr::kable(vif_df, caption = "Variance Inflation Factors for the Model")
```

### Scatterplot

```{r, fig.width=8, fig.height=6, warning=FALSE, message=FALSE}
longer<-house %>%
  pivot_longer(cols = c(
    "heatedarea",
    "age",
    "crime",
    "transit"),
               names_to = "Variable",
               values_to = "Value")%>%
  st_drop_geometry()

ggplot(longer,aes(x = Value, y = price)) +
  geom_point(color = "black", size= 0.4) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c("heatedarea" = "Heated Area",
                                                                  "age" = "Age of the House",
                                                                  "crime" = "Number of Crime",
                                                                  "transit" = "Transit Coverage"
  )))  +
  theme_light() +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6),
        axis.title=element_text(size=8)) +
  labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
       x = "Value",
       y = "House price")

```

## Step 4: Mapping

### Dependent Variable- Housing price

```{r, fig.width=10, fig.height=10, warning=FALSE,message=FALSE}
map<-st_join(house, income)%>%
  group_by(GEOID)%>%
  summarise(price=mean(price))%>%
  st_drop_geometry()

map<- left_join(income, map, by="GEOID")%>%
  filter(!is.na(price))


ggplot(map) +
  geom_sf(aes(fill= q5(price)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(map, 'price')), 0),
                     name = 'Housing price') +
  labs(title = "Average House Price by Census Block Group",
       fill = "Price") +
    theme(
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 10, face = "italic"),
        plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
```

### Key predictors




```{r, fig.width=10, fig.height=10}
Q1<-ggplot(map) +
  geom_sf(aes(fill= q5(income)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(map, 'income')), 0),
                     name = 'Income') +
  labs(title = "Average Household Income by Census Block Group",
       fill = "Income") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

Q2<-ggplot(tree) +
  geom_sf(aes(fill= q5(canopy)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
                     name = '% of Tree Coverage') +
  labs(title = "Tree Canopy by Neighborhood",
       fill = "% of Tree Coverage") +
    theme(
        legend.text = element_text(size =4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

recreation<-recreation%>% filter(!is.na(recreation))
Q3<-ggplot(recreation) +
  geom_sf(aes(fill= q5(recreation)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(tree, 'canopy')), 0),
                     name = '% of Recreation Coverage') +
  labs(title = "Recreation facilities coverage by Neighborhood",
       fill = "% of Recreational facilities Coverage") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

transit<-transit%>% filter(!is.na(transit))

Q4<-ggplot(transit) +
  geom_sf(aes(fill= q5(transit)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(transit, 'transit')), 0),
                     name = '% of Transit Coverage') +
  labs(title = "Public Transit coverage by Neighborhood",
       fill = "% of Public Transit Coverage") +
    theme(
        legend.text = element_text(size = 4),
        legend.title = element_text(size = 5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 5, face = "italic"),
        plot.title = element_text(size = 10, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))

# Arrange the plots in a 2x2 grid
grid.arrange(Q1, Q2, Q3, Q4, ncol = 2)

```

# Model Building & Evaluation

## Train-Test Split

In this model building process, I spilt the training and testing datasets based on 80% training and 20% testing to ensure the model have a better predictivity

```{r}
# Set the seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
train_index <- createDataPartition(house$price, p = 0.8, list = FALSE)

train <- house[train_index, ]
test <- house[-train_index, ]
```

## Model Building and Evaluation

### Initial model
```{r}
model1 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation, data = train)

test<-test%>%
  mutate(predictions = predict(model1, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
```
### Include the dummy variables
```{r}
model2 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age + ac_dummy + heat_dummy + income + crime + transit + canopy + recreation, data = train)

test<-test%>%
  mutate(predictions = predict(model2, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
```
### Include the Categorical Variables
```{r}
model3 <- lm(logprice ~ halfbaths + heatedarea + numfirepla + bldggrade + units + age+ income + crime + transit + canopy + recreation + type, data = train)

test<-test%>%
  mutate(predictions = predict(model3, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
```
### Final model
```{r}
model4 <- lm(logprice ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)
summary(model4)
```
```{r}
test<-test%>%
  mutate(predictions = predict(model4, newdata = test),
         actual= logprice,
         error= actual - predictions,
         abserror= abs(error),
         ape=(abs(actual-predictions)) / actual)

# Calculate Mean Absolute Error (MAE)
mae <- mean(test$abserror)

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((test$actual - test$predictions)^2))

# Calculate R-squared for the test set
sse <- sum((test$actual - test$predictions)^2)
sst <- sum((test$actual - mean(test$actual))^2)
r2 <- 1 - sse/sst

# Print the results
results <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", "Test R-squared"),
  Value = c(round(mae, 5), round(rmse, 5), round(r2, 5))
)

# Print results using kable
kable(results, caption = "Model Performance Metrics")
```


### model diagnostics

```{r, fig.width=10, fig.height=10}
par(mfrow = c(2, 2))
plot(model4)
```

**Compare with non log transformed price model**
```{r, fig.width=10, fig.height=10}
model5 <- lm(price ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = train)

par(mfrow = c(2, 2))
plot(model5)
```

# Spatial Analysis & Residual Diagnostics

```{r, warning=FALSE, message=FALSE}
model_final<-lm(logprice ~  heatedarea + numfirepla+ bldggrade + units + age+ income + crime + transit + canopy + type+ ac_dummy + heat_dummy, data = house)

fitted_values <- fitted(model_final)
residuals <- residuals(model_final)

house<- house %>%
  mutate(fitted = fitted_values,
         residuals = residuals)

ggplot(house, aes(x = fitted, y = logprice)) +
  geom_point(color = "black", size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Predicted Values vs. Observed Values",
       x = "Observed Values",
       y = "Predicted Values") +
  theme_light() +
  theme(plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(size = 7),
        axis.text.y = element_text(size = 7),
        axis.title = element_text(size = 8))

```

```{r, fig.width=10, fig.height=10, meassage=FALSE, warning=FALSE}
residual_map<-st_join(income, house)%>%
  group_by(GEOID)%>%
  summarise(residual=mean(residuals))%>%
  filter(!is.na(residual))

ggplot(residual_map) +
  geom_sf(aes(fill= q5(residual)), color='white') +
  scale_fill_manual(values = flatreds5,
                     labels = function(x) round(as.numeric(qBr(residual_map, 'residual')), 6),
                     name = 'Residuals') +
  labs(title = "Residuals Spatial Distribution",
       fill = "Residuals") +
    theme(
        legend.text = element_text(size =8),
        legend.title = element_text(size = 10),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.subtitle = element_text(size = 8, face = "italic"),
        plot.title = element_text(size = 20, hjust= 0.5,face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
  
```
```{r, fig.width=10, fig.height=6, message=FALSE, warning=FALSE}
# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(test) 

# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))

# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")

moran <- moran.mc(test$error, spatialWeights, nsim = 999)

ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()

```
```{r, fig.width=10, fig.height=6, message=FALSE, warning=FALSE}
# Extract the coordinates from the spatial dataframe
coords <- st_coordinates(house) 

# Create a neighbor list using k-nearest neighbors (KNN) with k=5
neighborList <- knn2nb(knearneigh(coords, 5))

# Convert the neighbor list to a spatial weights matrix
spatialWeights <- nb2listw(neighborList, style="W")

moran <- moran.mc(house$residuals, spatialWeights, nsim = 999)

ggplot(as.data.frame(moran$res[c(1:999)]), aes(moran$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moran$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  theme_minimal()
```

# Conclusion
